Below is presented the svg/pdf resulting from modifying the figure using Inkscape.
Data set SENIC describes the result of measurements taken at different US hospitals. The description of the variables is given in the accompanying document SENIC.pdf.
The data is loaded and the first 5 rows are showned below.
## ID X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
## 1 1 7.13 55.7 4.1 9.0 39.6 279 2 4 207 241 60
## 2 2 8.82 58.2 1.6 3.8 51.7 80 2 2 51 52 40
## 3 3 8.34 56.9 2.7 8.1 74.0 107 2 3 82 54 20
## 4 4 8.95 53.7 5.6 18.9 122.8 147 2 4 53 148 40
## 5 5 11.20 56.5 5.7 34.5 88.9 180 2 1 134 151 40
## 6 6 9.76 50.9 5.1 21.9 97.0 150 2 2 147 106 40
Below is declared the function that suffices both requirements.
outliers = function(vec){
# Stop condition for the outliers function.
if (!(is.vector(vec) & is.numeric(vec))){
print("ERROR: Either the object is not a vector or is not numeric.")
stop()
}
# Getting the length of the vector in
# order to construct the indices.
n = length(vec)
idxs = 1:n # Indices.
# Getting the quantiles of the vector
# to calculate outliers.
quantiles = quantile(vec)
q1 = quantiles[[2]] # First quantile.
q3 = quantiles[[4]] # Third quantile.
# Constructing the boolean mask that
# is going to be used to construct the
# indices.
mask = (vec > q3 + 1.5 * (q3 - q1)) | (vec < q1 - 1.5 * (q3 - q1))
idxs = idxs[mask]
return(idxs)
}
We can see that for the Infection Risk variable there are 5 outliers (if we don’t take into account any possible overlapping). These outliers are responsible for the weird tails that the distribution has. If we were to ignore them, probably the kernel density estimation (KDE) would have a different shape.
This plot shows more information from what’s econded in the plot from the step 4. In this case we can see that there is a cubic relationship between Infection Risk and the Number of Nurses. As the Infection Risk increases it’s variance does as well, so the relationship eventhough exists is not really that clear. We can also argue that there is a limit somewhere 200 number of nurses were extra nurses wont reduce the infection risk. As for the Number of Beds it looks like there is a relationship with the Number of Nurses, since it’s visible that it changes with the ‘limit’ found for the nurses.
The danger of using such a color scale is that it is going to be affected heavily by outliers. If for example the hospital beds are around 200 +- 20 and theres a big hospital with 800 beds, the scale is going to make it look like most hospitals have the same number of beds since they are going to have a similar colour, losing visual information. Another danger about this particular scale colour, is that it’s not so easy to distinguish low valued points when they are close to high valued ones, like for example in the right side of the graph.
We gained the ability to hover over the density and outlier values and get the points to which they belong. We also gained the ability to zoom in and out the graph, that way we can make sure visually that there are only 5 outliers in the data.
Comment how the graphs change with vaying bandwith and which bandwidth value is optimal from your point of view.
The bandwidth parameter changes how smooth is going to be the kernel density estimation (KDE). The lower the value the more faithful to the data the plot is going to be but the hard to spot patterns will be. In this case, we consider that the value 0.38 is a reasonable number between smoothing and fidelity for the graph.
# TASK 8
library(shiny)
# Creating a vector for the names
# of the variables so it's easier
# for the user to read.
feature_names = c("Length of Stay",
"Age",
"Infection Risk",
"Routine Culturing Ratio",
"Routine Chests X-ray Ratio",
"Number of Beds",
"Medical School Affiliation",
"Region",
"Average Daily Census",
"Number of Nurses",
"Avialable Facilities & Services")
checkbox_list = list()
for (i in 1:length(feature_names)){
checkbox_list[[i]] = checkboxInput(df_names[(i + 1)], feature_names[i], FALSE)
}
# Creating the UI for the shiny app.
ui = fluidPage(
sliderInput(inputId="ws", label="Choose bandwidth size", value=1, min=0.1, max=1),
checkbox_list,
plotOutput("densPlot")
)
# Server side functions.
server = function(input, output) {
output$densPlot <- renderPlot({
graphs = list()
counter = 1
for (name in df_names[2:length(df_names)]){
if (input[[name]] == TRUE)
{
graphs[[counter]] = density_outliers(df[, name], name, bw=input$ws)
counter = counter + 1
}
}
if(length(graphs)>0){
g = grid.arrange(grobs=graphs)
g
}
})
}
# Run the application
shinyApp(ui = ui, server = server)
# TASK 1
# Importing the data that we are going to use
# and taking a look at it so that we know it
# was imported properly.
df = read.table("SENIC.txt")
head(df)
# Changing the names of the columns.
df_names = c("ID")
for (i in 1:11){
df_names = c(df_names, c(paste("X", as.character(i), sep="")))
}
names(df) = df_names
head(df)
# TASK 2
outliers = function(vec){
# Stop condition for the outliers function.
if (!(is.vector(vec) & is.numeric(vec))){
print("ERROR: Either the object is not a vector or is not numeric.")
stop()
}
# Getting the length of the vector in
# order to construct the indices.
n = length(vec)
idxs = 1:n # Indices.
# Getting the quantiles of the vector
# to calculate outliers.
quantiles = quantile(vec)
q1 = quantiles[[2]] # First quantile.
q3 = quantiles[[4]] # Third quantile.
# Constructing the boolean mask that
# is going to be used to construct the
# indices.
mask = (vec > q3 + 1.5 * (q3 - q1)) | (vec < q1 - 1.5 * (q3 - q1))
idxs = idxs[mask]
return(idxs)
}
# TASK 3
library(ggplot2)
outliers_idxs = outliers(df[, "X3"])
Y = rep(0, length(outliers_idxs))
X = df[outliers_idxs, "X3"]
g = ggplot() + geom_point(aes(x=X, y=Y), shape=5, size=5) + geom_density(aes(df$X3)) + xlab("X3")
print(g)
# TASK 4
library("grid")
library("gridExtra")
density_outliers = function(vec, name, bw="nrd0")
{
outliers_idxs = outliers(vec)
Y = rep(0, length(outliers_idxs))
X = vec[outliers_idxs]
g = ggplot() + stat_density(aes(vec), bw=bw) + geom_point(aes(x=X, y=Y), shape=5, size=3) + xlab(name)
return(g)
}
graphs = list()
counter = 1
for (name in df_names[2:length(df_names)]){
graphs[[counter]] = density_outliers(df[, name], name)
counter = counter + 1
}
grid.arrange(grobs=graphs, ncol=4)
# TASK 5 : TODO:
# g = ggplot(aes(x = df$X3, y = df$X10, color=df$X6))
g = ggplot() + geom_point(aes(x=df$X3, y=df$X10), shape=1, size=3, color=df$X6)
g
# TASK 6
library("plotly")
ggplotly(g)
# TASK 7
p = plot_ly() %>%
add_histogram(x = df[, "X3"]) %>%
add_trace(x=X, y=Y, mode="markers", type="scatter", marker=list(symbol="diamond", size=10)) %>%
layout(bargap = 0.05)
p
# TASK 8
library(shiny)
# Creating a vector for the names
# of the variables so it's easier
# for the user to read.
feature_names = c("Length of Stay",
"Age",
"Infection Risk",
"Routine Culturing Ratio",
"Routine Chests X-ray Ratio",
"Number of Beds",
"Medical School Affiliation",
"Region",
"Average Daily Census",
"Number of Nurses",
"Avialable Facilities & Services")
checkbox_list = list()
for (i in 1:length(feature_names)){
checkbox_list[[i]] = checkboxInput(df_names[(i + 1)], feature_names[i], FALSE)
}
# Creating the UI for the shiny app.
ui = fluidPage(
sliderInput(inputId="ws", label="Choose bandwidth size", value=1, min=0.1, max=1),
checkbox_list,
plotOutput("densPlot")
)
# Server side functions.
server = function(input, output) {
output$densPlot <- renderPlot({
graphs = list()
counter = 1
for (name in df_names[2:length(df_names)]){
if (input[[name]] == TRUE)
{
graphs[[counter]] = density_outliers(df[, name], name, bw=input$ws)
counter = counter + 1
}
}
if(length(graphs)>0){
g = grid.arrange(grobs=graphs)
g
}
})
}
# Run the application
shinyApp(ui = ui, server = server)